Add FP4 tile + vector types and memory operations by ccs1112 · Pull Request #54 · HazyResearch/HipKittens

ccs1112 · 2026-04-19T04:59:46Z

This PR closes #46 and #47. It lands fp4e2m1_2 as the canonical sub-byte tile dtype and adds the three load paths from #47.

FP4 breaks the sizeof(scalar) × packing::num() == sizeof(packed) invariant, so shared storage has to pick between packed and unpacked. Unpacked halves the shared budget and forces ALU nibble-packing on every shared → register load. A new st<fp4e2m1, ...> specialization would be perf-optimal but rewrites a core struct for one dtype. Using fp4e2m1_2 as the dtype, which I gleaned from ThunderKittens, has the same performance and costs only one relaxed static_assert, so I took this route.

rt_shape is type-agnostic, so num() has to count unpacked-type units per packed-type: 2 for FP4. The physical register layout is the same. rt_shape type params can be added as a follow up if so desired.

The crux is the new fp4e2m1_4 branch in shared_to_register.cuh. It's basically the fp8e4m3_4 branch copied with the new types, aliases and some compiler assertions. I don't think global_to_shared.cuh needs any changes because the existing byte math handles sizeof(dtype)==1 when the columns are in pairs.

I wrote a test file with some caveats:

GPU_TARGET=CDNA4 make -j13 compiles clean on MI300X (Hot Aisle, rocm/7.0-preview container)
I don't have MI35x access so if any one is able to run the unit tests for this branch on this device it would be greatly appreciated. I'm waiting to hear back from the AMD dev program on credits and device access.
Main branch is failing with GPU_TARGET=CDNA3. I know main targets CDNA4 so if v3 support has been dropped then this isn't a bug, otherwise the fix for this can be filed separately: Cannot select: intrinsic llvm.amdgcn.raw.buffer.load.lds

Introduce a packing<fp4e2m1_2> specialization (num()=1, self-referential unpacked_type) and modify packing<fp4e2m1_4>'s num() from 4 to 2 with unpacked_type = fp4e2m1_2. HipKittens' rt_shape is type-agnostic, so num() must count unpacked-type units per packed-type element for the packed_per_thread arithmetic to hold the bf16/half/fp8 pattern. Update the T1 concept to include fp4e2m1_2 in place of fp4e2m1. Allow fp4e2m1_4 in rt_base's dtype allowlist. Relax st.cuh's num()==1 assertion to allow fp4e2m1_2, and reject the scalar fp4e2m1 in st and gl with messages pointing users to the packed types. Add sv_fp4e2m1_2 and rt_fp4e2m1_2 aliases and fp4e2m1_2 <-> float2 convertors.

shared_to_register.cuh: new fp4e2m1_4 branch in the row-layout load at RT::base_tile_stride == 16 (rt_16x128), mirroring the existing fp8e4m3_4 branch in width (ds_read_b128 with float4 cast). Handles both ST>=RT and ST<=RT base-tile configurations. global_to_register.cuh: add fp4e2m1_2 to the existing fp8e4m3 rejection in all three static_asserts (row-load, col-load, row-store). Direct g->r isn't used for packed sub-byte types. Also reject fp4e2m1_2 in shared_to_register.cuh's register->shared store path (not in PR 1's scope), and reject the scalar fp4e2m1 in the load path to backstop the higher-level guards in st.cuh and gl.cuh.

testing_utils.cuh: pair-granularity initialize and validate branches for fp4e2m1_2. i_ref/o_ref are sized in pair units; each pair is packed as (f, f) so both halves dequantize to the same value. Tolerance sized to FP4's 16-point grid (absolute 0.5). fp4_load.cu (new): hand-rolled kernel that loads packed FP4 global -> shared -> register and dumps each thread's 32 fp4e2m1_4 elements as 128 floats. Host sorts both sides and does multiset comparison with FP4-grid tolerance. Exercises the new shared_to_register fp4e2m1_4 branch that the existing sharedreg_load_store round-trip can't reach (register -> shared FP4 store isn't in HazyResearch#47's scope). global_to_shared.cu: extend the type sweep with fp4e2m1_2, exercising the g->s and s->g paths for packed FP4 at every supported shape. testing_flags.cuh: wire TEST_WARP_MEMORY_TILE_FP4_LOAD into the TEST_ALL_WARP_MEMORY_TILE expansion and the TEST_WARP_MEMORY_TILE derivation. tile.cu/cuh dispatches the new test. Makefile adds the flag to the default build.

willhu-jpg · 2026-04-29T22:04:11Z

Thanks! Can we actually work on FP4 in a branch together? I think it'll keep the main branch more stable as we go about things.

ccs1112 · 2026-04-30T04:04:46Z

Sounds good, looks like there's no existing FP4 branch on this upstream. Do you want to make one and I'll retarget the PR? I don't have write access on HazyResearch:main so I would only be able to make a branch on my fork, which would kind of defeat the purpose.

ccs1112 added 3 commits April 18, 2026 13:51

ccs1112 changed the title ~~Add FP4 tile + vector types and memory operations (#46, #47)~~ Add FP4 tile + vector types and memory operations Apr 19, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add FP4 tile + vector types and memory operations#54

Add FP4 tile + vector types and memory operations#54
ccs1112 wants to merge 3 commits into
HazyResearch:mainfrom
ccs1112:fp4-p1-types-and-memory

ccs1112 commented Apr 19, 2026

Uh oh!

willhu-jpg commented Apr 29, 2026

Uh oh!

ccs1112 commented Apr 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ccs1112 commented Apr 19, 2026

Uh oh!

willhu-jpg commented Apr 29, 2026

Uh oh!

ccs1112 commented Apr 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants